This specification covers the CDXJ file format used by OpenWayback 3.0.0 (and later) to index web archive contents (notably in WARC and ARC files) and make them searchable via a resource resolution service.
The format builds on the CDX file format originally developed by the Internet Archive for the indexing behind the WaybackMachine. This specification builds on it by simplifying the primary fields while adding a flexible JSON ‘block’ to each record, allowing high flexiblity in the inclusion of additional data.
The use of a JSON in this manner is not novel. This specification is focused on enumarating the exact fields outside the JSON and creating a list of the most common JSON fields for cross compatibility reasons.
For the purposes of this document, CDXJ will be understood to refer to this particular specification.
While we recognize that this format may have wider application (e.g. for data exchange) the primary purpose of this specification is to establish a file format suitable for creating a simple, URI keyed, index to the resources of a web archive.
Each file is a plain text file, UTF-8 encoded. It should end each line with Unix style newline character (0x0A).
A CDXJ file that has been sorted can be refered to as a CDXJ index as it is easily searchable.
Each file should begin with a line declaring the file format and file format version. This line is preceeded with a bang symbol
(!
- 0x21) so that it naturally sorts to the front of the file.
Example:
!OpenWayback-CDXJ 1.0
This line may be repeated any number of times, as long as they are all sequential, starting from the first line of the file. The is to accomodate the merging of multiple CDXJ files that may be generated at different times.
It is permissible to mix minor version numbers (e.g. 1.0
and 1.1
) in the same file as minor versions are required to be backwards
compatible. In this scenario, parsing software should treat the entire file based on the highest observed version number.
It is not permissible to mix major version numbers (e.g. 1.0
and 2.0
) in the same file. It is understood that an increase in
the major version number indicates a change that is not backwards compatible. It is not possible to merge CDXJ files with
different major version numbers.
Lines beginning with the bang symbol may only appear at the top of the file, regardless of whether the file has been sorted.
Following the header lines, each additional line should represent exactly one resource in a web archive. Typically in a WARC (ISO 28500) or ARC file, although the exact storage of the resource is not defined by this specification. Each such line shall be refered to as a record.
Each line, or record, is composed of four (4) fields.
The fields are seperated by spaces (0x20). Consequently, spaces may not appear in the fields, except for the last field (JSON block).
Additionally, only the last (JSON block) field may begin with an opening curly brace ({
- 0x7B).
The first three fields are collectively known as the sortable fields.
The first field is a searchable version of the URI that this resource refers to.
By searchable, we mean that the following transformations have been applied to it:
Note: While this is extremly similar to other CDX and CDXJ implementations, we note that a lot of them get the SURT format wrong. Most notably by omitting the starting parenthesis or dropping the trailing comma in the domain name.
E.g. in using OpenWayback 2’s CDX server the URL `http://example.com/’ would translate to:
com,example)/
The correct SURT transformation is …
(com,example,)/
… once you include the third step of dropping the scheme.
The first field may not begin with a bang character (!
- 0x21). Only header lines may begin with this character.
For URI’s containing an Internationalized domain name (IDN), this field should always use the IDN format and not the Punycode representation.
The second field is a timestamp. It should correspond to the WARC-Date timestamp as of WARC 1.1.
A UTC timestamp as described in the W3C profile of ISO8601 [W3CDTF]. The timestamp shall represent the instant that data capture for record creation began. […]
WARC-Date may be specified at any of the six levels of granularity described in [W3CDTF]. If WARC-Date includes a decimal fraction of a second, the decimal fraction of a second shall have a minimum of 1 digit and a maximum of 9 digits. WARC-Date should be specified with as much precision as is accurately known.
This is a notable departure from the original CDX format, that used a somewhat truncated timestamp (YYYYMMDDhhmmss). The level of accuracy of the timestamp should match the accuracy that is available in the WARC (or other source material).
Note: All timestamps should be in UTC.
In general, this field is equivalent to the WARC-Date field of a WARC record.
Indicates what type of record the current line refers to. This field is fully compatible with WARC 1.0 definition of WARC-Type (chapter 5.5 and chapter 6).
For content not stored in WARCs, a reasonable equivalent should be chosen.
E.g.
The fourth and final field is a single line JSON block. This should contain fully valid JSON data. The only limitation, beyond those imposed by JSON encoding rules, is that this may not contain any newline characters, either in Unix (0x0A) or Windows form (0x0D0A). The first occurance of a 0x0A constitutes the end of this field (and the record).
The order of key/value pairs within the JSON block is unspecified. It is not expected that records can be sorted, in any way, on the JSON block field.
It is legal to store any amount of data in the JSON block. The following keys, however have a defined meaning. Further, some of these fields are required.
Defined JSON keys:
sha-1
) is not included in this field. See dig for alternative hashing algorithms.Additionally, the three sortable fields can be redundantly stored in the JSON block, if so desired using the following keys:
To reduce the possibility of incompatibility with future versions of CDXJ, the use of custom keys longer than 3 characters is recommended.
Alternatively, new keys may be discussed and agreed upon by the OpenWayback community prior to a revised CDXJ version being released. Allowing for quicker implementation of additional keys.
A sorted CDXJ file is considered a CDXJ index, allowing lookup on URI (and timestamp) using simple binary searches. The CDXJ structure is designed to facilitate this. A non-sorted CDXJ file is still valid as far as this specification goes. It is just not usable for searching.
When sorted, the sorting should be done based on the native byte values of the characters. This is equivalent to using the GNU sort with
the collation settings LC_ALL=C
.
The structure of the CDXJ is designed so that using common sorting tools (e.g. GNU sort utility) works as expected. Provided that the correct collation settings are set as described above.
Canonicalization is applied to URIs to remove trivial differences in the URIs that do not reflect that the URI reference different resources. Examples include removing session ID parameters, unneccessary port declerations (e.g. :80 when crawling HTTP).
OpenWayback implements its own canonicalization process. Typically, it will be applied to the searchable URIs in CDXJ files. You can, however, use any canonicalization scheme you care for (including none). You must simple ensure that the same canonicalization process is applied to the URIs when performing searches. Otherwise they may not match correctly.
SURT is a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names.
A URI <scheme://domain.tld/path?query>
has SURT form <scheme://(tld,domain,)/path?query>
.
Conversion to SURT form also involves making all characters lowercase, and changing the ‘https’ scheme to ‘http’. Further, the ‘/’ after a URI authority component – for example, the third slash in a regular HTTP URI – will only appear in the SURT form if it appeared in the plain URI form.
The ‘warcfile’ URI scheme shall be assumed to have the following structure:
warcfile:<warc-filename>#<offset>
The <warc-filename>
is the full filename of the WARC (including any suffixes) but excluding any path information. WARC filenames are
assumed to be unique.
The <offset>
is the number of bytes from the start of the WARC when the relevant record begins.
Example URI:
warcfile:IAH-20070824123353-00393-heritrix2.nb.no.arc.gz#25523382
To fully resolve a URI with this scheme, an index mapping WARC filenames to specific locations is needed.
!OpenWayback-CDXJ 1.0